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Abstract 



O 

(N 

' This paper discusses the topic of dimensionahty reduction for fc-means clustering. We prove that 

O , any set of n points in d dimensions (rows in a matrix A e R"^'^) can be projected into t = fl{k/e^) 

■ dimensions, for any e e (0, 1/3), in 0(n(i[e^^fc/ log((i)] ) time, such that with constant probability 
the optimal /c -partition of the point set is preserved within a factor of 2 + e. The projection is done 

■ by post-multiplying A with a d x t random matrix R having entries +l/^/t or with equal 
probability. A numerical implementation of our technique and experiments on a large face images 
dataset verify the speed and the accuracy of our theoretical results. 

< 

c/3 ! 1 Introduction 

The /c-means clustering algorithm jJISl was recently recognized as one of the top ten data mining tools of the last fifty 
years |20|. In parallel, random projections (RP) or the so-called Johnson-Lindenstrauss type embeddings 1 12| became 
^ ' popular and found applications in both theoretical computer science [2J and data analytics |4 |. This paper focuses on 

CsJ I the application of the random projection method (see Section |273] i to the fc-means clustering problem (see Definition 

■ [U. Formally, assuming as input a set of n points in d dimensions, our goal is to randomly project the points into d 
dimensions, with d <^ d, and then apply a fc-means clustering algorithm (see Definition|2|i on the projected points. Of 
course, one should be able to compute the projection fast without distorting significantly the "clusters" of the original 
point set. Our algorithm (see Algorithm[T]l satisfies both conditions by computing the embedding in time linear in the 
size of the input and by distorting the "clusters" of the dataset by a factor of at most 2 + e, for some e e (0, 1/3) (see 
Theorem[T]i. We believe that the high dimensionality of modern data will render our algorithm useful and attractive in 
many practical applications |i9j. 

Dimensionality reduction encompasses the union of two different approaches: feature selection, which embeds the 

■ points into a low-dimensional space by selecting actual dimensions of the data, and feature extraction, which finds an 
5-H ' embedding by constructing new artificial features that are, for example, linear combinations of the original features. 

Let j4 be an n X d matrix containing n d-dimensional points (^(i) denotes the i-th point of the set), and let fc be 
the number of clusters (see also Section IZ21 for more notation). We slightly abuse notation by also denoting by A 

the n-point set formed by the rows of A. We say that an embedding f : A ^ with /(v4(i)) = for all 
i £ [n] and some d < d, preserves the clustering structure of A within a factor 0, for some > 1, if finding an 
optimal clustering in A and plugging it back to A is only a factor of cj) worse than finding the optimal clustering 
directly in A. Clustering optimality and approximability are formally presented in Definitions [T] and |2l respectively. 
Prior efforts on designing provably accurate dimensionality reduction methods for fc-means clustering include: ( /) the 
Singular Value Decomposition (SVD), where one finds an embedding with image i = L/feEfc e such that the 

clustering structure is preserved within a factor of two; ( ii) random projections, where one projects the input points into 
t — ri(log(77)/e^) dimensions such that with constant probability the clustering structure is preserved within a factor 
of 1 + £ (see Section l23]) : (Hi) SVD-based feature selection, where one can use the SVD to find c = il{k log(fc/e)/e^) 
actual features, i.e. an embedding with image A e M" ^ ^ containing (rescaled) columns from A, such that with constant 
probability the clustering structure is preserved within a factor of 2 + e. These results are summarized in Table [T] A 
head-to-head comparison of our algorithm with existing results allows us to claim the following improvements: (ij 
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Year 


Ref. 


Description 


Dimensions 


Time 


Accuracy 


1999 


|6| 


SVD - feature extraction 


k 


0{ndmin{n, d\) 


2 




Folklore 


RP - feature extraction 


n(log{n)/e-') 


0(ndre-Mog(n)/log(d)l) 


l + s 


2009 


|5| 


SVD - feature selection 


n{k\ogik/e)/e') 


0{ndmin{ji, d}) 


2 + e 


2010 


This paper 


RP - feature extraction 


n{k/e^) 


Oind\e-'^k/log{d)']) 


2 + e 



Table 1 : Dimension reduction methods for fc-means. In the RP methods the construction is done with random sign matrices and 
the mailman algorithm (see Sections [2.3l and [3m respectively). 



reduce the running time by a factor of min{n, d} [e^ log((i) /fc] , while losing only a factor of e in the approximation 
accuracy and a factor of 1/e^ in the dimension of the embedding; (ii) reduce the dimension of the embedding and 
the running time by a factor of log(ri.)/A: while losing a factor of one in the approximation accuracy; (Hi) reduce the 
dimension of the embedding by a factor of \og{k/e) and the running time by a factor of min{n, d} [e^ log(d) / k~\ , 
respectively. Finally, we should point out that other techniques, for example the Laplacian scores flU\ or the Fisher 
scores |7|, are very popular in applications (see also surveys on the topic lISl fTSll ). However, they lack a theoretical 
worst case analysis of the form we describe in this work. 

2 Preliminaries 

We start by formally defining the A:-means clustering problem using matrix notation. Later in this section, we precisely 
describe the approximability framework adopted in the /c-means clustering literature and fix the notation. 

Definition 1. [The k-means clustering problem] 

Given a set of n points in d dimensions ( rows in an n x d matrix A) and a positive integer k denoting the number of 
clusters, find the n x k indicator matrix Xopt such that 

Xopt^a.igmm\\A-XX^A\\l. (!) 

Here X denotes the set of all n x fc indicator matrices X. The functional F{A, X) = \ A - XX'^ A\\ is the so-called 
fc-means objective function. An n x k indicator matrix has exactly one non-zero element per row, which denotes 
cluster membership. Equivalently, for all i — 1, . . . ,n and j ~ 1, . . . , fc, the i-th point belongs to the j-th cluster if 
and only if Xij — 1/ ^/z], where Zj denotes the number of points in the corresponding cluster. Note that X^ X = Ik, 
where Ik is the k x k identity matrix. 

2.1 Approximation Algorithms for fc-means clustering 

Finding Xopt is an NP-hard problem even for fc = 2 |'3^|, thus research has focused on developing approximation 
algorithms for fc-means clustering. The following definition captures the framework of such efforts. 

Definition 2. [k-means approximation algorithm] 

An algorithm is a "^-approximation" for the k-means clustering problem (j > 1) if it takes inputs A and fc, and 
returns an indicator matrix X^ that satisfies with probability at least 1 — d^, 

II A - XyXjAWl < 7 min \\A- XX^Alll . (2) 
II ' 7 iiF xex " 

In the above, d-y G [0, 1) is the failure probability of the ^-approximation k-means algorithm. 

For our discussion, we fix the 7-approximation algorithm to be the one presented in llT4l . which guarantees 7 = 1 + 2' 

for any e' G (0, 1] with running time 0(2''''/'^'' ^ dn). 

2.2 Notation 

Given an n x d matrix A and an integer fc with fc < mm{n,d}, let Uk G M"^*^ (resp. Vk G R'^^'^') be the matrix 
of the top fc left (resp. right) singular vectors of A, and let Efe G M'^^'"' be a diagonal matrix containing the top 
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k singular values of A in non-increasing order. If we let p be the rank of A, then Ap^k is equal to ^ — ^fe, with 
= Uk'^kV/J ■ By we denote the j-th row of A. For an index i taking values in the set {1, . . . , n} we write 
i e [n]. We denote, in non-increasing order, the non-negative singular values of A by <Ji{A) with i e [p\. \\A\\^ and 
||y4||2 denote the Frobenius and the spectral norm of a matrix A, respectively. A'' denotes the pseudo-inverse of A, i.e. 
the unique d x n matrix satisfying A = AA^f A, A'' AA^f = A^, {AA''^ = AA\ and {A^ A^ = A^f A. Note also 
that 11^^112 — <^ii^'^) — l/fp(^) ^nd ||A||2 = ''■i(^) = l/fp(^^)- A. useful property of matrix norms is that for 
any two matrices C and T of appropriate dimensions, ||Cr||p < ||C||p||T||2; this is a stronger version of the standard 
submultiplicavity property. We call P a projector matrix if it is square and — P. We use E [Y] and Var [Y] to take 
the expectation and the variance of a random variable Y and P (e) to take the probability of an event e. We abbreviate 
"independent identically distributed" to "i.i.d." and "with probability" to "w.p.". Finally, all logarithms are base two. 

2.3 Random Projections 

A classical result of Johnson and Lindenstrauss states that any n-point set in d dimensions - rows in a matrix A E R"^'' 
- can be linearly projected into t ~ r2(log(n)/e^) dimensions while preserving pairwise distances within a factor of 
1 ± e using a random orthonormal matrix (12] . Subsequent research simplified the proof of the above result by showing 
that such a projection can be generated using ad x t random Gaussian matrix R, i.e., a matrix whose entries are i.i.d. 
Gaussian random variables with zero mean and variance |11 1. More precisely, the following inequality holds 

with high probability over the randomness of R, 

(1 - e) II - II2 < ||^(,oi? - A,)i?||2 < (1 + e) - II2 . (3) 

Notice that such an embedding A = AR preserves the metric structure of the point-set, so it also preserves, within a 
factor of 1 + £, the optimal value of the /c-means objective function of A. Achlioptas proved that even a (rescaled) 
random sign matrix suffices in order to get the same guarantees as above HI, an approach that we adopt here (see step 
two in Algorithm[T]l- Moreover, in this paper we will heavily exploit the structure of such a random matrix, and obtain, 
as an added bonus, savings on the computation of the projection. 

3 A random-projection-type A;-means algorithm 

Algorithm[T]takes as inputs the matrix A e M"^'', the number of clusters k, an error parameter e S (0, 1/3), and some 
7-approximation /c-means algorithm. It returns an indicator matrix determining a fc-partition of the rows of A. 



Input: n X d matrix A (n points, d features), number of clusters fc, error parameter e G (0, 1/3), and 
7-approximation /c-means algorithm. 

Output: Indicator matrix X;y determining a A: -partition on the rows of A. 

1. Set t = Vlik/e^), i.e. set t = to > ck/e^ for a sufficiently large constant c. 

2. Compute a random d x t matrix R as follows. For all i S [d], j E [t] 

^ r+i/Vt,w.p. 1/2, 

\-l/Vt,w.p. 1/2. 

3. Compute the product A = AR. 

4. Run the 7-approximation algorithm on A to obtain X^; Return the indicator matrix Xj 



Algorithm 1: A random projection algorithm for fc-means clustering. 



3.1 Running time analysis 

Algorithm [1] reduces the dimensions of A by post-multiplying it with a random sign matrix R. Interestingly, any 
"random projection matrix" R that respects the properties of Lemma|2]with t = r2(fc/e^) can be used in this step. If R 
is constructed as in Algorithm[Tl one can employ the so-called mailman algorithm for matrix multiplication 11151 and 
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compute the product AR in 0{nd\e~'^k/ log(d)] ) time. Indeed, the mailman algorithm computes (after preprocessing 
Q) a matrix-vector product of any d-dimensional vector (row of A) with an d x log{d) sign matrix in 0{d) time. 
By partitioning the columns of our d x t matrix R into \t/\og{d)~\ blocks, the claim follows. Notice that when 
k — 0{\og{d)), then we get an - almost - linear time complexity 0{nd/e^). The latter assumption is reasonable in our 
setting since the need for dimension reduction in /c-means clustering arises usually in high-dimensional data (large d). 
Other choices of R would give the same approximation results; the time complexity to compute the embedding would 
be different though. A matrix where each entry is a random Gaussian variable with zero mean and variance 
would imply an 0{knd/e'^) time complexity (naive multiplication). In our experiments in Section|5]we experiment 
with the matrix R described in Algorithm [T] and employ MatLab's matrix-matrix BLAS implementation to proceed 
in the third step of the algorithm. We also experimented with a novel MatLab/C implementation of the mailman 
algorithm but, in the general case, we were not able to outperform MatLab's built-in routines (see section ISlZt . 

Finally, note that any 7-approximation algorithm may be used in the last step of Algorithm [T] Using, for example, 
the algorithm of jT4| with 7 = 1 + e would result in an algorithm that preserves the clustering within a factor of 
2 + e, for any e G (0, 1/3), running in time 0{nd\e^'^k/ log((i)] + 2^'^/^)^ ^ kn/e^). In practice though, the Lloyd 
algorithm |fT6l ITtI is very popular and although it does not admit a worst case theoretical analysis, it empirically 
does well. We thus employ the Lloyd algorithm for our experimental evaluation of our algorithm in Section |5] Note 
that, after using the proposed dimensionality reduction method, the cost of the Lloyd heuristic is only 0(nfc^/e^) per 
iteration. This should be compared to the cost of 0{knd) per iteration if applied on the original high dimensional data. 



4 Main Theorem 



Theorem [T] is our main quality-of-approximation result for Algorithm [T] Notice that if 7 = 1, i.e. if the fc-means 
problem with inputs A and k is solved exactly, Algorithm[T]guarantees a distortion of at most 2 + e, as advertised. 

Theorem 1. Let the nxd matrix A and the positive integer k < min{r7,, d} be the inputs of the k-means clustering 
problem. Let e € (0, 1/3) and assume access to a ^-approximation k-means algorithm. Run Algorithm\l\with inputs 
A, k, e, and the ^-approximation algorithm in order to construct an indicator matrix X=j. Then with probability at 
least 0.97 — 5^, 

\\A - X^X]^A\\l < (1 + (1 + £)7) \\A - XoptXjptA\f^ . (4) 



Proof of Theorem [T] 

The proof of Theorem [T] employs several results from |19| including Lemma 6, 8 and Corollary 11. We summarize 
these results in Lemma |2]below. Before employing Corollary 11, Lemma 6, and Lemma 8 from 1 19J we need to make 
sure that the matrix R constructed in Algorithm[T]is consistent with Definition 1 and Lemma 5 in |fT9i . Theorem 1.1 
of 1 1 1 immediately shows that the random sign matrix R of Algorithm[T]satisfies Definition 1 and Lemma 5 in ||T9l . 

Lemma 2. Assume that the matrix R is constructed by using Algorithm\l\with inputs A, k and e. 
7. Singular Values Preservation: For all i e [k] and w.p. at least 0.99, 

\l~a.,{V^R)\<e. 



2. Matrix Multiplication: For any two matrices S G 



ixd 



and T G 



pdx fc 



E 



\ST- srr'^tWI 



<^ii^iiFriiF- 



3. Moments: For any C G 



ixd. 



E 



Ic^IIfI = IIC|If««^^«'"[IIc^IIf] < 2||c||p/i. 



The first statement above assumes c being sufficiently large (see step 1 of Algorithm [TJ. We continue with several 
novel results of general interest. 



'Reading the input d x logd sign matrix requires O(dlogd) time. However, in our case we only consider multiplication with 
a random sign matrix, therefore we can avoid the preprocessing step by directly computing a random correspondence matrix as 



discussed in Il5i Preprocessing Section]. 
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Lemma 3. Under the same assumptions as in Lemma\2\and w.p. at least 0.99, 



< 3e 



(5) 



Proof. Let $ = R; note that $ is a fc x t matrix and the SVD of $ is <f> = where [/$ and E$ arekx k 

matrices, and V$ is a t x fc matrix. By taking the SVD of (Vj7i?) and (Vj?^i?)^ we get 

{V/J R)^ — {VfJ R)^ ^ — \\V^'S^^Uj — V^'E^U^W^ — ||V$(S^^ — S$)?7,jr||2 ~ 11^*^ ^^^IL' 

since V$ and can be dropped without changing any unitarily invariant norm. Let ^ — ^<j»^ — ^<]>5 ^I/ is a /c x k 
diagonal matrix. Assuming that, for all i e [k], o'i(<i>) and Ti(^') denote the i-th largest singular value of $ and the 
i-th diagonal element of ^P, respectively, it is 



Since 5* is a diagonal matrix. 



^'11 2 = max Ti{^) = max 



l<i<k ' l<i<k 0-i($) 

The first statement of Lemma |2] our choice of e e (0, 1/3), and elementary calculations suffice to conclude the 
proof. □ 

Lemma 4. Under the same assumptions as in Lemma\2\and for any n x d matrix C w.p. at least 0.99, 

\\CR\\^ < VTl+^nip- (6) 

Proof. Notice that there exists a sufficiently large constant c such that t > ck/e"^. Then, setting Z — || Ci? ||p, using 
the third statement of Lemma|2] the fact that k > 1, and Chebyshev's inequality we get 



Z -E[Z]\>e\\C\\l) < 



^<^< A<o.oi. 

The last inequality follows assuming c sufficiently large. Finally, taking square root on both sides concludes the 
proof. □ 

Lemma 5. Under the same assumptions as in Lemma\2\and w.p. at least 0.97, 



Ak^{AR){V^R)Wj +E, 



(7) 



where E is an n x d matrix with \\E\\p < 4e — ^fc||p. 



Proof Since (Ai?)(VjTi?) V^J^ is an n x d matrix, let us write E = Ak ~ {AR){V,J R)W,J . Then, setting A 
Ak + Ap^k, and using the triangle inequality we get 



I^^IIf < 



Ak - AkR{V^R)W^ 



Ap^kR{VjR)W^ 



The first statement of Lemma|2]implies that rank( VjTi?) — k thus (VfJ R){V^^ Ry ~ 7^, where is the A: x fc identity 
matrix. Replacing A^ = UkEkVfl and setting {VfJ R){VfI R)^ = h we get that 



Ak - AkR{V^R)Wy 



Ak - Uk^kV^ R{V^ R)W^ 



= \\Ak-UkEkV^\l = 0. 



To bound the second term above, we drop VjJ, add and subtract the matrix Ap^kRiVfJ R)^ V/J , and use the triangle 
inequality and submultiplicativity: 



Ap.kRiV^Rfv^ 



< \\Ap^kR{VjRy 



Ap_kR{{V^R)^ - (VkJRy 



< \\Ap-kRR''Vk\\^ + \\Ap-kR\\, {VkJRY^iVjRy 
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Now we will bound each term individually. A crucial observation for bounding the first term is that Ap^kVk = 
J7p_fcl]p_fe VgLfcVfc = by orthogonality of the columns of Vk and Vp_fe. This term now can be bounded using the 
second statement of Lemma |2] with S = Ap-k and T = Vk- This statement, assuming c sufficiently large, and an 
application of Markov's inequality on the random variable \\Ap-kRR^Vk — Ap_fcVfc||p give that w.p. at least 0.99, 

\\Ap_kRR'^Vk\\^ < 0.5e\\Ap_k\\p- (8) 

The second two terms can be bounded using Lemma[3]and Lemma|4]on C = Ap^k- Hence by applying a union bound 
on Lemma[3j Lemma|4]and Inq. ([8]), we get that w.p. at least 0.97, 



|£^IIf < 



{Ap^kRR'^VkL + \\Ap-kR\\p (V^Ry ~ {V^ R) 



< 0.5£||Ap_fc||p + V(l + e)ll^p-fellF-3e 

< 0.5e||Ap_fe||p + 3.5£||Ap_fc||p 



4e' 



□ 



The last inequality holds thanks to our choice of e e (0, 1/3). 

Proposition 6. A well-known property connects the SVD of a matrix and k-means clustering. Recall Definition\l\ and 
notice that XoptXjp^A is a matrix of rank at most k. From the SVD optimality we immediately get that 

WAp^kWl = \\A-Ak\\l < \\A~X,ptX^p,A\^^. 



(9) 



4.1 The proof of Eqn. (H of Theorem[T] 



We start by manipulating the term — X-yX^74||p in Eqn. (|4|. Replacing A by Ak + Ap^k, and using the 

Pythagorean theorem (the subspaces spanned by the components Ak — X^jXj Ak and Ap^k — X-yXj Ap^k are 
perpendicular) we get 



\A-X^xJA\1 = - X^xJ)Ak\\^ + - X^xJ)Ap^k\ 



F ■ 



(10) 



We first bound the second term of Eqn. ( fTOb . Since / — X^Xj is a projector matrix, it can be dropped without 
increasing a unitarily invariant norm. Now Proposition|6]implies that 



< \\A 



< 



1^ — XoptXgp^A 



We now bound the first term of Eqn. ( flOb : 



h < 

< 



{I ~ X^xJ)AR{VkR)Wf^ 
{I ^ X^xJ)AR\lj{VkR)^ 
< ^\\{I-XoptXjp,)AR\l 



+ \\E\\ 



(VkR)^ 



\E\\ 



\E\\ 



< ^{l + 2.5e)\\{I-XoptXjp,)A\l +V7 4£||(/-XoptXT,)A||p 

< ^il + 6.5e)\\{I-X,ptXjpt)A\l 



(11) 

(12) 
(13) 
(14) 

(15) 

(16) 
(17) 



In Eqn. (fTZt we used Lemma|5] the triangle inequality, and the fact that / — X^Xj is a projector matrix and can be 
dropped without increasing a unitarily invariant norm. In Eqn. (fT3] i we used submultiplicativity (see Section |Z2] | and 
the fact that VfJ can be dropped without changing the spectral norm. In Eqn. (fT4l i we replaced Xj by Xopt and the 
factor ^/j appeared in the first term. To better understand this step, notice that X^ gives a 7-approximation to the 
optimal fc-means clustering of the matrix AR, and any other n x k indicator matrix (for example, the matrix Xopt) 
satisfies 

"(l-X^Xj)AR\\l< 7 mm\\{I-XX^)AR\\l < - XopiXj^^) AR\\1 . 



XGA" ' 
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Figure 1 : The results of our experiments after running Algorithm 1 with fc = 40 on the face images collection. 



In Eqn. ( fTSl ) we used Lemma|4]with C = (/ — XoptXjpf)A, Lemma[3]and Proposition |6l In Eqn. ( fTSI l we used the 
fact that 7 > 1 and that for any e e (0,1/3) it is + e)/(l — e) < 1 + 2.5e. Taking squares in Eqn. ( fTTT i we get 

el < ^{l + 28e)\\{I-X,ptXjp,)A\\l. 
Finally, rescaling e accordingly and applying the union bound on Lemma|5]and Definition |2] concludes the proof. 

5 Experiments 

This section describes an empirical evaluation of Algorithm [T] on a face images collection. We implemented our 
algorithm in MatLab and compared it against other prominent dimensionality reduction techniques such as the Local 
Linear Embedding (LLE) algorithm and the Laplacian scores for feature selection. We ran all the experiments on a 
Mac machine with a dual core 2.26 Ghz processor and 4 GB of RAM. Our empirical findings are very promising 
indicating that our algorithm and implementation could be very useful in real applications involving clustering of 
large-scale data. 

5.1 An application of Algorithm 1 on a face images collection 

We experiment with a face images collection. We downloaded the images corresponding to the ORL database from 
||2T|. This collection contains 400 face images of dimensions 64 x 64 corresponding to 40 different people. These 
images form 40 groups each one containing exactly 10 different images of the same person. After vectorizing each 
2-D image and putting it as a row vector in an appropriate matrix, one can construct a 400 x 4096 image-by-pixel 
matrix A. In this matrix, objects are the face images of the ORL collection while features are the pixel values of the 
images. To apply the Lloyd's heuristic on A, we employ MatLab's function kmeans with the parameter determining 
the maximum number of repetitions setting to 30. We also chose a deterministic initialization of the Lloyd's iterative 
E-M procedure, i.e. whenever we call kmeans with inputs a matrix A E with d> 1, and the integer k ~ 40, 

we initialize the cluster centers with the 1-st, 11-th,..., 391-th rows of A, respectively. Note that this initialization 
corresponds to picking images from the forty different groups of the available collection, since the images of every 
group are stored sequentially in A. We evaluate the clustering outcome from two different perspectives. First, we 
measure and report the objective function F of the /c-means clustering problem. In particular, we report a normalized 
version of F, i.e. F — Second, we report the mis-classification accuracy of the clustering result. We 

denote this number by P (0 < P < 1), where P = 0.9, for example, implies that 90% of the objects were assigned 
to the correct cluster after the application of the clustering algorithm. In the sequel, we first perform experiments by 
running Algorithm [T] with everything fixed but t, which denotes the dimensionality of the projected data. Then, for 
four representative values of t, we compare Algorithm [T] with three other dimensionality reduction methods as well 
with the approach of running the Lloyd's heuristic on the original high dimensional data. 

We run Algorithm[T]with t — 5, 10, 300 and fc = 40 on the matrix A described above. Figure [T] depicts the results 
of our experiments. A few interesting observations are immediate. First, the normalized objective function P is a 
piece-wise non-increasing function of the number of dimensions t. The decrease in P is large in the first few choices 
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t=10 


t = 20 


t = 50 


t= 100 




P 


F 


P 


F 


P 


F 


p 


F 


SVD 


0.5900 


0.0262 


0.6750 


0.0268 


0.7650 


0.0269 


0.6500 


0.0324 


LLE 


0.6500 


0.0245 


0.7125 


0.0247 


0.7725 


0.0258 


0.6150 


0.0337 


LS 


0.3400 


0.0380 


0.3875 


0.0362 


0.4575 


0.0319 


0.4850 


0.0278 


HD 


0.6255 


0.0220 


0.6255 


0.0220 


0.6255 


0.0220 


0.6255 


0.0220 


RP 


0.4225 


0.0283 


0.4800 


0.0255 


0.6425 


0.0234 


0.6575 


0.0219 



Table 2: Numerics from our experiments with five different methods. 



of t\ then, increasing the number of dimensions t of the projected data decreases F by a smaller value. The increase 
of t seems to become irrelevant after around i = 90 dimensions. Second, the mis-classification rate P is a piece-wise 
non-decreasing function of t. The increase of t seems to become irrelevant again after around t = 90 dimensions. 
Another interesting observation of these two plots is that the mis-classification rate is not directly relevant to the 
objective function F. Notice, for example, that the two have different behavior from < = 20 to t = 25 dimensions. 
Finally, we report the running time T of the algorithm which includes only the clustering step. Notice that the increase 
in the running time is - almost - linear with the increase of t. The non-linearities in the plot are due to the fact that 
the number of iterations that are necessary to guarantee convergence of the Lloyd's method are different for different 
values of t. This observation indicates that small values of t result to significant computational savings, especially 
when n is large. Compare, for example, the one second running time that is needed to solve the fc-means problem 
when t = 275 against the 10 seconds that are necessary to solve the problem on the high dimensional data. To our 
benefit, in this case, the multiplication AR takes only 0.1 seconds resulting to a total running time of 1.1 seconds 
which corresponds to an almost 90% speedup of the overall procedure. 

We now compare our algorithm against other dimensionality reduction techniques. In particular, in this paragraph 
we present head-to-head comparisons for the following five methods: (i) SVD: the Singular Value Decomposition 
(or Principal Components Analysis) dimensionality reduction approach - we use MatLab's svds function; (ii) LLE: 
the famous Local Linear Embedding algorithm of flSl - we use the MatLab code from |23| with the parameter K 
determining the number of neighbors setting equal to 40; (iii) LS: the Laplacian score feature selection method of ITOl 
- we use the MatLab code from |22| with the default parameter^ (v) HD: we run the /c-means algorithm on the High 
Dimensional data; and (vi) RP: the random projection method we proposed in this work - we use our own MatLab 
implementation. The results of our experiments QnA,k = 40 and t = 10, 20, 50, 100 are shown in Table|2] In terms of 
computational complexity, for example t ~ 50, the time (in seconds) needed for all five methods (only the dimension 
reduction step) are Tsvd = 5.9, Tlle = 4.4, Tls = 0.32, Trd = 0, and Trp = 0.03. Notice that our algorithm 
is much faster than the other approaches while achieving worse (t = 10, 20), slightly worse (t — 50) or slightly better 
it = 100) approximation accuracy results. 

5.2 A note on the mailman algorithm for matrix-matrix and matrix-vector multiplication 

In this section, we compare three different implementations of the third step of Algorithm[T] As we already discussed 
in Section 13.11 the mailman algorithm is asymptotically faster than naively multiplying the two matrices A and R. 
In this section we want to understand whether this asymptotic behavior of the mailman algorithm is indeed achieved 
in a practical implementation. We compare three different approaches for the implementation of the third step of 
our algorithm: the first is MatLab's function times{A, R) (MMl); the second exploits the fact that we do not need to 
explicitly store the whole matrix R, and that the computation can be performed on the fly (column-by-column) (MM2); 
the last is the mailman algorithm |15J (see Section [TTI for more details). We implemented the last two algorithms in 
C using MatLab's MEX technology. We observed that when A is a vector (n = 1), then the mailman algorithm 
is indeed faster than (MMl) and (MM2) as it is also observed in the numerical experiments of fTSl. Moreover, it's 
worth-noting that (MM2) is also superior compared to (MMl). On the other hand, our best implementation of the 
mailman algorithm for matrix-matrix operations is inferior to both (MMl) and (MM2) for any 10 < n < 10, 000. 
Based on these findings, we chose to use (MMl) for our experimental evaluations. 
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^In particular, we run W — constructW {A); Scores = LaplacianScore{A, W); 
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